-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Implement OpenAI token counting using tiktoken
#3447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement OpenAI token counting using tiktoken
#3447
Conversation
| return event_loop | ||
|
|
||
|
|
||
| def num_tokens_from_messages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OpenAI specific so it should live in models/openai.py
|
|
||
| def num_tokens_from_messages( | ||
| messages: list[ChatCompletionMessageParam] | list[ResponseInputItemParam], | ||
| model: OpenAIModelName = 'gpt-4o-mini-2024-07-18', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a default value
| else: | ||
| raise NotImplementedError( | ||
| f"""num_tokens_from_messages() is not implemented for model {model}.""" | ||
| ) # TODO: How to handle other models? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to reverse engineer the right formula for gpt-5?
As long as we document that this is a best effort calculation and may not be accurate down to the exact token, we can have one branch of logic for "everything before gpt-5" and one for every newer. If future models have different rules, we can update the logic then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with a decreased primer the calculation for gpt5 is more accurate.
Should the method from the cookbook be the default for all other models?
| try: | ||
| encoding = tiktoken.encoding_for_model(model) | ||
| except KeyError: | ||
| print('Warning: model not found. Using o200k_base encoding.') # TODO: How to handle warnings? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No warnings please, let's just make a best effort
| ) | ||
|
|
||
|
|
||
| def num_tokens_from_messages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a private function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added _
| elif 'gpt-5' in model: | ||
| return num_tokens_from_messages(messages, model='gpt-5-2025-08-07') | ||
| else: | ||
| raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify all of this as if 'gpt-5' in model: <do the new thing> else: <do the old thing>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the is executed for all other models except gpt5?
if 'gpt-5' in model:
tokens_per_message = 3
final_primer = 2 # "reverse engineered" based on test cases
else:
# Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls
tokens_per_message = 3
final_primer = 3 # every reply is primed with <|start|>assistant<|message|>Or only for the models explicitly in the cookbook and raise error for others?
| 'gpt-4o-2024-08-06', | ||
| }: | ||
| tokens_per_message = 3 | ||
| final_primer = 3 # every reply is primed with <|start|>assistant<|message|> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include a link to the doc we took this from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link added.
| 'gpt-5-2025-08-07', | ||
| }: | ||
| tokens_per_message = 3 | ||
| final_primer = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it explicit that this one was "reverse engineered"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
tests/models/test_openai.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test the entire exception message as here:
pydantic-ai/tests/models/test_anthropic.py
Lines 6449 to 6464 in c050590
| async def test_anthropic_model_usage_limit_exceeded( | |
| allow_model_requests: None, | |
| anthropic_api_key: str, | |
| ): | |
| model = AnthropicModel('claude-sonnet-4-5', provider=AnthropicProvider(api_key=anthropic_api_key)) | |
| agent = Agent(model=model) | |
| with pytest.raises( | |
| UsageLimitExceeded, | |
| match='The next request would exceed the input_tokens_limit of 18 \\(input_tokens=19\\)', | |
| ): | |
| await agent.run( | |
| 'The quick brown fox jumps over the lazydog.', | |
| usage_limits=UsageLimits(input_tokens_limit=18, count_tokens_before_request=True), | |
| ) | |
tests/models/test_openai.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect :)
tests/models/test_openai.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have a test for a system other than openai, right? You can get that from building an OpenAIChatModel with a different provider, like OllamaProvider for example
| tokens_per_message = 3 | ||
| final_primer = 2 # "reverse engineered" based on test cases | ||
| else: | ||
| # Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the cookbook again, I think we should also try to implement support for counting the tokens of tool definitions:
tests/models/test_openai.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing against our own hard-coded values, can we compare against the real token usage data returned by the API (that will be recorded in cassettes)?
I'd also like to see tests with more complicated message structures, e.g. with tool calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these files so large?
100,286 additions
Can we remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tiktoken library downloads necessary encoding files e.g. this one. Usually thats cached in TIKTOKEN_CACHE_DIR.
To allow execution of the tests independently the cassette for each test stores this file in the current version.
Is there a better way to handle these MB sized files? Maybe put them in the cache in a fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense... What do you think about mocking the tiktoken calls, so we can keep these files out of the repo?
| num_tokens = 0 | ||
| for message in messages: | ||
| num_tokens += tokens_per_message | ||
| for value in message.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird, as it assumes every string value in the message dict will be sent to the model. That may be the case for ChatCompletionMessageParam, but not for ResponseInputItemParam, which is a union that includes things like Message:
class Message(TypedDict, total=False):
content: Required[ResponseInputMessageContentListParam]
"""
A list of one or many input items to the model, containing different content
types.
"""
role: Required[Literal["user", "system", "developer"]]
"""The role of the message input. One of `user`, `system`, or `developer`."""
status: Literal["in_progress", "completed", "incomplete"]
"""The status of item.
One of `in_progress`, `completed`, or `incomplete`. Populated when items are
returned via API.
"""
type: Literal["message"]
"""The type of the message input. Always set to `message`."""I don't think those status and type fields end up with the model. But it'd be worth verifying by comparing our calculation with real data from the API, as I suggested below in the tests.
| for message in messages: | ||
| num_tokens += tokens_per_message | ||
| for value in message.values(): | ||
| if isinstance(value, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also don't currently handle lists of strings properly, for example ChatCompletionMessageParam can be ChatCompletionUserMessageParam:
class ChatCompletionUserMessageParam(TypedDict, total=False):
content: Required[Union[str, Iterable[ChatCompletionContentPartParam]]]
"""The contents of the user message."""
role: Required[Literal["user"]]
"""The role of the messages author, in this case `user`."""
name: str
"""An optional name for the participant.
Provides the model information to differentiate between participants of the same
role.
"""content may just be a str, but could also be a list of ChatCompletionContentPartTextParam:
class ChatCompletionContentPartTextParam(TypedDict, total=False):
text: Required[str]
"""The text content."""
type: Required[Literal["text"]]
"""The type of the content part."""We shouldn't exclude that text from the count.
Same for ResponseInputItemParam, which can have text hidden inside lists.
Unfortunately OpenAI makes it very hard for us to calculate this stuff correctly, but I'd rather have no count_tokens method than one that only works in very specific unrealistic scenarios -- most users are going to have more complicated message histories than the ones we're currently accounting for. So I think we should either implement some smarter behavior (and verify in the tests that it works!), or "give up". Let me know if you're up for the challenge :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am up try to count the tokens for more complicated histories. Seems there are quite a few possible inputs based on your comment.
Are there any test cases which I can use as a starting point which represent a more complicated structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wirthual Not specifically, but if you look at the types in the ModelRequest.parts, UserPromptPart.content and ModelResponse.parts type unions, it's pretty easy to (have AI) build one of each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wirthual Just found https://github.com/pamelafox/openai-messages-token-helper which may be worth using or looking at for inspiration
tiktoken
Update doc string Co-authored-by: Douwe Maan <[email protected]>
| num_tokens += tokens_per_message | ||
| for value in message.values(): | ||
| if isinstance(value, str): | ||
| num_tokens += len(encoding.encode(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this (or the get_encoding call further up?) could download a large file, but tiktoken is sync not async, we should wrap the call that may do a download in _utils.run_in_executor to run it in a thread
Work for #3430
Took method 1:1 from OpenAI cookbook. However the example only shows for certain models. How should other models be handled? Quick test with gpt-5 showed a token number that differed from this method.
gpt-5 Warning: gpt-5 may update over time. Returning num tokens assuming gpt-5-2025-08-07. 110 prompt tokens counted by num_tokens_from_messages(). 109 prompt tokens counted by the OpenAI API.Test script